In this work, we propose an approach to the spatiotemporal localisation(detection) and classification of multiple concurrent actions within temporallyuntrimmed videos. Our framework is composed of three stages. In stage 1,appearance and motion detection networks are employed to localise and scoreactions from colour images and optical flow. In stage 2, the appearance networkdetections are boosted by combining them with the motion detection scores, inproportion to their respective spatial overlap. In stage 3, sequences ofdetection boxes most likely to be associated with a single action instance,called action tubes, are constructed by solving two energy maximisationproblems via dynamic programming. While in the first pass, action pathsspanning the whole video are built by linking detection boxes over time usingtheir class-specific scores and their spatial overlap, in the second pass,temporal trimming is performed by ensuring label consistency for allconstituting detection boxes. We demonstrate the performance of our algorithmon the challenging UCF101, J-HMDB-21 and LIRIS-HARL datasets, achieving newstate-of-the-art results across the board and significantly increasingdetection speed at test time. We achieve a huge leap forward in actiondetection performance and report a 20% and 11% gain in mAP (mean averageprecision) on UCF-101 and J-HMDB-21 datasets respectively when compared to thestate-of-the-art.
展开▼